Added RAY_SERVE_CONTROLLER_METRICS_INCLUDE_HIGH_CARDINALITY_TAGS to c…#63642
Added RAY_SERVE_CONTROLLER_METRICS_INCLUDE_HIGH_CARDINALITY_TAGS to c…#63642johntaylor-cell wants to merge 5 commits into
Conversation
…ontrol inclusion of high-cardinality tags from controller metrics Signed-off-by: john.taylor <john.taylor@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a feature flag RAY_SERVE_CONTROLLER_METRICS_INCLUDE_HIGH_CARDINALITY_TAGS to conditionally drop high-cardinality tags (such as replica and handle IDs) from controller-emitted metrics, helping to reduce metric volume in large deployments. The reviewer suggests optimizing the helper function _get_autoscaling_metrics_delay_tags by passing the high-cardinality key and value as separate arguments instead of a dictionary. This avoids unnecessary dictionary allocations on the hot path when high-cardinality tags are disabled. Corresponding updates are also suggested for the call sites and unit tests.
Signed-off-by: john.taylor <john.taylor@anyscale.com>
|
please link the GH issue this fixes in the description |
abrarsheikh
left a comment
There was a problem hiding this comment.
there are other metrics that controller emits that includes replica. Let's apply the change to those as well
https://docs.ray.io/en/latest/serve/monitoring.html#replica-lifecycle-metrics
https://docs.ray.io/en/latest/serve/monitoring.html#autoscaling-metrics
https://docs.ray.io/en/latest/serve/monitoring.html#autoscaling-metrics
Signed-off-by: john.taylor <john.taylor@anyscale.com>
Signed-off-by: john.taylor <john.taylor@anyscale.com>
Signed-off-by: john.taylor <john.taylor@anyscale.com>
|
Added ability to disable replica metrics in |
|
|
||
| def _get_autoscaling_metrics_delay_tag_keys( | ||
| high_cardinality_tag_keys: Tuple[str, ...], | ||
| ) -> Tuple[str, ...]: |
There was a problem hiding this comment.
why Tuple[str, ...] instead of str?
|
|
||
| logger = logging.getLogger(SERVE_LOGGER_NAME) | ||
|
|
||
| _REPLICA_METRIC_BASE_TAG_KEYS = ("deployment", "application") |
There was a problem hiding this comment.
we can exclude the changes in this file because these metrics are being emitted from the replica processes and not the controller. As a result it doesn't suffer from the same metric volume problem.
| import pytest | ||
|
|
||
| from ray.serve._private.common import TargetCapacityDirection | ||
| from ray.serve._private.common import DeploymentID, TargetCapacityDirection |
There was a problem hiding this comment.
let's use the test_metrics.py to test this out instead of relying on unit testing internal private functions.
test_metrics is a end to end test.
|
|
||
| from ray._common.ray_constants import DEFAULT_MAX_CONCURRENCY_ASYNC | ||
| from ray._raylet import NodeID | ||
| from ray.serve._private import ( |
There was a problem hiding this comment.
same comment here, let's use end to end tests.
| "healthy, 0 means unhealthy." | ||
| ), | ||
| tag_keys=("deployment", "replica", "application"), | ||
| tag_keys=_get_replica_lifecycle_metric_tag_keys(), |
There was a problem hiding this comment.
nit: my personal preference is to just inline the logic to include high cardinality tags or not. current layout is not really achieving write-once-use-everywhere, we are still having to repeat this logic in various places. Also the logic is a bit hard to read/duplicatedwith tags/tag-keys.
This PR addresses his issue: #63114
API/Schema Changes
Added env var RAY_SERVE_CONTROLLER_METRICS_INCLUDE_HIGH_CARDINALITY_TAGS, default 1.
When set to 0, Serve controller autoscaling delay metrics drop high-cardinality source tags:
handle from ray_serve_autoscaling_handle_metrics_delay_ms and replica from ray_serve_autoscaling_replica_metrics_delay_ms.
Internal Call Sites
Updated ServeController metric gauge construction and emission in python/ray/serve/_private/controller.py.
Added helpers to keep gauge tag_keys and emitted tags consistent based on the env var.
Tests Added
Added unit coverage in python/ray/serve/tests/unit/test_controller.py for both modes:
default includes handle / replica
opt-out keeps only deployment / application
Tests Run
git diff --check
python -m py_compile changed Python files
python -m pytest python/ray/serve/tests/unit/test_controller.py -q -k autoscaling_metrics_delay_tags
Focused pytest result:
2 passed, 70 deselected in 0.18s
Backwards Compatibility
Fully backwards compatible by default. Existing metric labels remain unchanged unless users explicitly set RAY_SERVE_CONTROLLER_METRICS_INCLUDE_HIGH_CARDINALITY_TAGS=0.